210 research outputs found
Construction of minimal DFAs from biological motifs
Deterministic finite automata (DFAs) are constructed for various purposes in
computational biology. Little attention, however, has been given to the
efficient construction of minimal DFAs. In this article, we define simple
non-deterministic finite automata (NFAs) and prove that the standard subset
construction transforms NFAs of this type into minimal DFAs. Furthermore, we
show how simple NFAs can be constructed from two types of patterns popular in
bioinformatics, namely (sets of) generalized strings and (generalized) strings
with a Hamming neighborhood
Sensitive Long-Indel-Aware Alignment of Sequencing Reads
The tremdendous advances in high-throughput sequencing technologies have made
population-scale sequencing as performed in the 1000 Genomes project and the
Genome of the Netherlands project possible. Next-generation sequencing has
allowed genom-wide discovery of variations beyond single-nucleotide
polymorphisms (SNPs), in particular of structural variations (SVs) like
deletions, insertions, duplications, translocations, inversions, and even more
complex rearrangements. Here, we design a read aligner with special emphasis on
the following properties: (1) high sensitivity, i.e. find all (reasonable)
alignments; (2) ability to find (long) indels; (3) statistically sound
alignment scores; and (4) runtime fast enough to be applied to whole genome
data. We compare performance to BWA, bowtie2, stampy and find that our methods
is especially advantageous on reads containing larger indels
Next Generation Cluster Editing
This work aims at improving the quality of structural variant prediction from
the mapped reads of a sequenced genome. We suggest a new model based on cluster
editing in weighted graphs and introduce a new heuristic algorithm that allows
to solve this problem quickly and with a good approximation on the huge graphs
that arise from biological datasets
Constructing Founder Sets Under Allelic and Non-Allelic Homologous Recombination
Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements - including deletion, duplication, and inversion - and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR.
In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where human haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, and demonstrate scalability to problem instances arising in practice
SNP and indel frequencies at transcription start sites and at canonical and alternative translation initiation sites in the human genome
Single-nucleotide polymorphisms (SNPs) are the most common form of genetic variation in humans and drive phenotypic variation. Due to evolutionary conservation, SNPs and indels (insertion and deletions) are depleted in functionally important sequence elements. Recently, population-scale sequencing efforts such as the 1000 Genomes Project and the Genome of the Netherlands Project have catalogued large numbers of sequence variants. Here, we present a systematic analysis of the polymorphisms reported by these two projects in different coding and non-coding genomic elements of the human genome (intergenic regions, CpG islands, promoters, 5' UTRs, coding exons, 3' UTRs, introns, and intragenic regions). Furthermore, we were especially interested in the distribution of SNPs and indels in direct vicinity to the transcription start site (TSS) and translation start site (CSS). Thereby, we discovered an enrichment of dinucleotides CpG and CpA and an accumulation of SNPs at base position -1 relative to the TSS that involved primarily CpG and CpA dinucleotides. Genes having a CpG dinucleotide at TSS position -1 were enriched in the functional GO terms "Phosphoprotein", "Alternative splicing", and "Protein binding". Focusing on the CSS, we compared SNP patterns in the flanking regions of canonical and alternative AUG and near-cognate start sites where we considered alternative starts previously identified by experimental ribosome profiling. We observed similar conservation patterns of canonical and alternative translation start sites, which underlines the importance of alternative translation mechanisms for cellular function
An Algorithm to Compute the Character Access Count Distribution for Pattern Matching Algorithms
We propose a framework for the exact probabilistic
analysis of window-based pattern matching algorithms, such as
Boyer--Moore, Horspool, Backward DAWG Matching, Backward Oracle
Matching, and more. In particular, we develop an algorithm that
efficiently computes the distribution of a pattern matching
algorithm's running time cost (such as the number of text character
accesses) for any given pattern in a random text model. Text models
range from simple uniform models to higher-order Markov models or
hidden Markov models (HMMs). Furthermore, we provide an algorithm to
compute the exact distribution of \emph{differences} in running time
cost of two pattern matching algorithms. Methodologically, we use
extensions of finite automata which we call \emph{deterministic
arithmetic automata} (DAAs) and \emph{probabilistic arithmetic
automata} (PAAs)~\cite{Marschall2008}. Given an algorithm, a
pattern, and a text model, a PAA is constructed from which the
sought distributions can be derived using dynamic programming. To
our knowledge, this is the first time that substring- or
suffix-based pattern matching algorithms are analyzed exactly by
computing the whole distribution of running time cost.
Experimentally, we compare Horspool's algorithm, Backward DAWG
Matching, and Backward Oracle Matching on prototypical patterns of
short length and provide statistics on the size of minimal DAAs for
these computations
CLEVER: Clique-Enumerating Variant Finder
Next-generation sequencing techniques have facilitated a large scale analysis
of human genetic variation. Despite the advances in sequencing speeds, the
computational discovery of structural variants is not yet standard. It is
likely that many variants have remained undiscovered in most sequenced
individuals. Here we present a novel internal segment size based approach,
which organizes all, including also concordant reads into a read alignment
graph where max-cliques represent maximal contradiction-free groups of
alignments. A specifically engineered algorithm then enumerates all max-cliques
and statistically evaluates them for their potential to reflect insertions or
deletions (indels). For the first time in the literature, we compare a large
range of state-of-the-art approaches using simulated Illumina reads from a
fully annotated genome and present various relevant performance statistics. We
achieve superior performance rates in particular on indels of sizes 20--100,
which have been exposed as a current major challenge in the SV discovery
literature and where prior insert size based approaches have limitations. In
that size range, we outperform even split read aligners. We achieve good
results also on real data where we make a substantial amount of correct
predictions as the only tool, which complement the predictions of split-read
aligners. CLEVER is open source (GPL) and available from
http://clever-sv.googlecode.com.Comment: 30 pages, 8 figure
Mixed-order Ambisonics recording and playback for improving horizontal directionality
Planar (2D) and periphonic (3D) higher-order Ambisonics (HOA) systems are widely used to reproduce spatial properties of acoustic scenarios. Mixed-order Ambisonics (MOA) systems combine the benefit of higher order 2D systems, i.e. a high spatial resolution over a larger usable frequency bandwidth, with a lower order 3D system to reproduce elevated sound sources. In order to record MOA signals, the location of the microphones on a hard sphere were optimized to provide a robust MOA encoding. A detailed analysis of the encoding and decoding process showed that MOA can improve both the spatial resolution in the horizontal plane and the usable frequency bandwidth for playback as well as recording. Hence the described MOA scheme provides a promising method for improving the performance of current 3D sound reproduction systems.7 page(s
- …